This vignette will demonstrate the integration of CITE-seq and CyTOF protein expression measurements using cyCombine.


First, let us present the two datasets we will integrate:

For the CITE-seq data, we will use the ‘1k Human PBMCs Stained with a Panel of TotalSeq B Antibodies, Dual Indexed’ dataset from https://support.10xgenomics.com/single-cell-gene-expression/datasets. We work on the filtered feature/cell matrix.

For the CyTOF data, we use the data from a single healthy donor processed at the Human Immune Monitoring Center. The sample was derived from FlowRepository https://flowrepository.org/id/FR-FCM-ZYAJ and pre-gated to live intact singlets. The data was subsequently manually gated to populations of interest.

Pre-processing the CITE-seq data

We start by loading some packages

library(cyCombine)
library(tidyverse)
library(Seurat)

Then, we load the Total-seq dataset and preprocess that using Seurat

# Read total-seq data
totalseq_read <- Read10X(data.dir = "filtered_feature_bc_matrix/")

# Create Seurat object, have to set min.cells = 0 to allow addition of ADT information in the next step. 
totalseq <- CreateSeuratObject(counts = totalseq_read$`Gene Expression`, min.cells = 0, min.features = 0, project = "cyCombine")

# Add CITE-seq data
totalseq[["ADT"]] <- CreateAssayObject(counts = totalseq_read$`Antibody Capture`)

# Add MT percentage for cell filtering
totalseq[["percent.mt"]] <- PercentageFeatureSet(object = totalseq, pattern = "^MT-")

# Filter data to exclude cells with very large/small feature counts (high number of different genes expressed) and large mitchondrial content
VlnPlot(object = totalseq, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), pt.size = 0)

totalseq <- subset(totalseq, subset = nFeature_RNA > 200 & nFeature_RNA < 3500 & percent.mt < 10)



Now, we have our Total-seq dataset and we have filtered away some of the more questionable cells. Now, we are technically ready for integration, but here, we will actually also cluster the Total-seq data, so we have a better chance of evaluating the integration performance.

# First, we perform normalization, scaling, dimensionality reduction, and clustering based on the RNA-seq levels
totalseq <- NormalizeData(totalseq, verbose = F) %>% 
  FindVariableFeatures(selection.method = "vst", nfeatures = 2000, verbose = F) %>%
  ScaleData(verbose = F) %>%
  RunPCA(pc.genes = totalseq@var.genes, npcs = 20, verbose = F)

totalseq <- totalseq %>% 
  RunUMAP(dims = 1:20, verbose = F, seed.use = 574) %>% 
  FindNeighbors(dims = 1:20, verbose = F) %>%
  FindClusters(resolution = 0.5, verbose = F)
#> Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
#> To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
#> This message will be shown once per session

# Next, we normalize and scale the CITE-seq portion of the dataset
totalseq <- NormalizeData(totalseq, assay = "ADT", normalization.method = "CLR", verbose = F) %>% 
  ScaleData(verbose = F)

# Now, we alter the rownames of the CITE-seq data slightly, so they're easier to work with
rownames(totalseq@assays$ADT@data) <- gsub('\\.1$', '', rownames(totalseq@assays$ADT@data))
rownames(totalseq@assays$ADT@data) <- gsub('[ _-]', '', rownames(totalseq@assays$ADT@data))


# Let us see the clustering based on RNA levels
DimPlot(totalseq)

# Now, we use the protein and RNA expression levels to label clusters - we do this at a relatively high level 
FeaturePlot(totalseq, features = c('CD3E', 'CD4', 'CD8A', 'CD14', 'CD19', 'NCAM1', 'adt_CD3', 'adt_CD4', 'adt_CD8a', 'adt_CD14', 'adt_CD19', 'adt_CD56'), ncol = 3)

Please note, that we have plotted the corresponding gene/protein pairs in all cases. For this, we can observe that the protein and gene levels correlate, but they are not completely 1:1.




# Giving clusters names
new.cluster.ids <- c("CD4pos", "CD8pos", "CD4pos", 'Myeloid cells', 'B cells', "CD8pos", 'NK cells', "CD4pos")
names(new.cluster.ids) <- levels(totalseq)
totalseq <- RenameIdents(totalseq, new.cluster.ids)

# Visualizing in UMAP with new labels
DimPlot(totalseq)

Now we are ready to extract the CITE-seq data for integration using cyCombine

# Extract CITE-seq data (normalized)
citeseq <- totalseq@assays$ADT@data %>% 
                              t() %>% 
                              as_tibble()

# Add columns to tibble for cyCombine processing
citeseq <- citeseq %>%
  mutate('batch' = 'CITEseq',
         'sample' = 'Sample1',
         'label' = Idents(totalseq)) 

Reading the CyTOF data

Then it is time to read the CyTOF data. In this case, it’s already in a tibble format - but refer to the prepare_data function for this.

### Get HIMC cytof data for one sample
load('Sample001_pregated.Rdata')

Processing datasets together - batch correction

Next, we need to figure out which markers to use when integrating the CITE-seq and CyTOF data. Here, it is important to keep a couple of things in mind.

First, be careful with protein names - i.e. CD197 and CCR7 are actually the same protein, but each name is relatively commonly used. Make sure to correct the column names in either set, so it matches the other, i.e. using colnames(cytof)[colnames(cytof)=='CCR7'] <- 'CD197'.

Second, one should be wary about using CITE-seq proteins with very low or no dynamic range - these do not add any information, and should be removed as they disrupt the analysis. Let us look for such proteins in the 1k PBMC set:

VlnPlot(totalseq, features = get_markers(citeseq), assay = 'ADT', pt.size = 0.01, group.by = 'orig.ident', ncol = 5)
#> Warning in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents =
#> idents, : All cells have the same value of CD27.
#> Warning in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents =
#> idents, : All cells have the same value of CD137.
#> Warning in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents =
#> idents, : All cells have the same value of CD197.
#> Warning in SingleExIPlot(type = type, data = data[, x, drop = FALSE], idents =
#> idents, : All cells have the same value of CD335.



So, now we have an overview of the distributions. Here, we will exclude those where the “red violin” is not visible in the plot.

# Remove columns
citeseq <- citeseq %>%
  select(-c(CD11b, CD20, CD27, CD28, CD34, CD62L, CD69, CD80, CD86, CD137, CD197, CD274, CD278, CD335, HLADR))

Now, we are ready to combine the two datasets on the overlapping columns.

# Get overlapping column names
overlap_cols <- intersect(colnames(citeseq), colnames(cytof))

# Make one tibble
uncorrected <- bind_rows(cytof[,overlap_cols], citeseq[,overlap_cols]) %>%
  mutate(id = 1:(nrow(cytof)+nrow(citeseq)))

And finally, batch correction can be performed with cyCombine.

# Run batch correction
corrected <- uncorrected %>%
  batch_correct(xdim = 8,
                ydim = 8,
                norm_method = 'rank',
                ties.method = 'average')

# Add manually assigned labels back
corrected$label <- uncorrected$label

Evaluating batch correction

We now evaluate the correction using EMD - each marker in each cluster is evaluated

emd <- evaluate_emd(preprocessed = uncorrected,
             corrected = corrected)

emd$violin

emd$scatterplot



Finally, let us look at some plots to visualize the correction. First, the marker distributions before and after:

plot_density(uncorrected, corrected)

Finally, some UMAPs. This function is a little heavy in computation power.

plots <- plot_umap_labels(uncorrected, corrected)

plots[[1]]

plots[[2]]